Upload
dina-dora-gardner
View
213
Download
0
Embed Size (px)
Citation preview
Sep 11, 2009
Automatic Transformation of Applications onto GPUs and GPU
Clusters
PhD Candidacy presentation: Wenjing Ma
Advisor: Dr Gagan Agrawal
The Ohio State University
Sep 11, 2009
Outline of Contents• Introduction
• Accelerators, GPGPU and GPU cluster• Difficulty of GPU programming
• Current Work• Translation system for enabling data mining applications on
GPUs• Automatic translation of data mining applications from
MATLAB to GPUs• Automatic code generation for data mining on clusters with
GPU support• Future work
• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework
Sep 11, 2009
Introduction• Accelerators, GPGPU and GPU cluster
– Multi-core architectures are more and more popular in high performance computing
– GPU, Cell Processor, FPGA– GPU has good performance/price ratio
• Difficulty of Programming – How to program a cluster with accelerators on
each node ?
Sep 11, 2009
Our Approach
• Provide high-level support for programming emerging high-end configurations • High-level language as input• Target code for particular devices as output
• Focus on specific application classes• Simplifies code generation
Sep 11, 2009
Specific Implementations
C/C++
MATLAB
……
GPU cluster
GPU
……
Sep 11, 2009
Outline of Contents• Introduction• Current Work
• Translation system for enabling data mining applications on GPUs
• Automatic translation of data mining applications from MATLAB to GPUs
• Automatic code generation for data mining on clusters with GPU support
• Future work• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework
Sep 11, 2009
Translation System for Enabling Data Mining Applications on
GPUs
• Focus on data mining applications• Common structure
• Translation system• Simple user input
• Optimization for GPU execution
Sep 11, 2009
Complications of GPU Programming
• User has to have thoroughly understand the architecture of GPU and the programming model of the language
• Has to deal with the memory allocation • Need to develop and optimize strategies for
data movement between memory hierarchies (example is the shared memory in GPU)
Sep 11, 2009
Parallel Data Mining
• Common structure of data mining applications (adapted from FREERIDE)
• while() {
/* Reduction loop */
Foreach (element e) {
(i, val) = process(e);
Reduc(i) = Reduc(i) op val;
}
Finalize();
}
Sep 11, 2009
Basic Idea of Data Mining on GPU
• Parallel shared memory programming• 3 steps involved in the main loop
• Reading data
• Computing update
• Writing back update
Sep 11, 2009
• User input• Code analyzer
• Analysis of variables (variable type and size)
• Analysis of reduction functions (sequential code from the user)
• Code Generator ( generating CUDA code and C++ code invoking the kernel function)• Optimization
GREENRIDE: Translation System for Enabling Data Mining Applications on GPUs
Sep 11, 2009
Architecture of the System
Variable information
Reduction functions
Optional functions Code
Analyzer( In LLVM)
Variable Analyzer
Code Generato
r
Variable Access
Pattern and Combination Operations
Host Program
Data copy and thread
grid configuration
Kernel functions
Executable
User Input
Sep 11, 2009
User Input
A sequential reduction function
Optional functions (initialization function, combination function…)
Value of variable / size of array
Variables to be used in the reduction function
Sep 11, 2009
Analysis of Sequential Code
• Identify the parallel loop• One major loop is parallelized
• Get the access features of each variable• Figure out the data to be replicated• Get the operator for global combination• Calculate the size of shared memory to use
and which data to be kept in shared memory
Sep 11, 2009
Get the Access Features for Each Variable
• Use LLVM to generate the intermediate representation
• Analyze the intermediate representation and get the information of each pointer• Trace the pointers used in “store” operations, which
are output pointers• The other pointers in argument list are input
variables• The pointers that don’t appear in the argument list
are temporary storage
Sep 11, 2009
Data to be Replicated
• Variables that need to be replicated for each thread• Variables to be written
• Different threads may access the same object
• Need combination at the end
Sep 11, 2009
Generate CUDA and C++/C Code
• Revisit Common structure of data mining applications (adapted from FREERIDE)
• while() {
/* Reduction loop */
Foreach (element e){
(i, val) = process(e);
Reduc(i) = Reduc(i) op val;
}
Finalize();
}
Sep 11, 2009
Generate CUDA and C++/C Code
• Memory allocation and copy• Reduction functions and global combination• Optimization
• Using shared memory on GPU
Sep 11, 2009
Memory Allocation and CopyDevice Memory allocation, copying between memory Device Memory allocation, copying between memory hierarchies, replication for threadshierarchies, replication for threads
Need replication for each threadNeed replication for each thread
T0 T1 T2 T3 T4 T61 T62 T63 T0 T1
…… ……
T0 T1 T2 T3 T4 T61 T62 T63 T0 T1
……
AA..
BB..
Sep 11, 2009
Reduction Functions• Global function
• Invoke the kernel reduction, global combination function,
• Invoke user specified initialization and combination functions, if any
• Kernel reduction function• Generated out of the original sequential code
• Divide the main loop by block number and thread number
• Rewrite the access indices
• ……
• Global combination
Sep 11, 2009
Program Structure of the Generated CUDA Code
• Host program Copy_data(HOST_TO_DEVICE);
Device_reduct();
Copy_data(DEVICE_TO_HOST);
Combination();
• Device function:
Device_reduct()
{
Reduction();
__syncthreads();
Combination();
__syncthreads();
}
Sep 11, 2009
Optimizations
• Use shared memory on GPU ( almost “must”! )• If provided, use user-specified initialization
functions and combination functions• Specify variables that are allocated once and
reused by multiple invocations of the kernel• …
Sep 11, 2009
Deal with Shared Memory
• Use simple heuristics• Size = length * sizeof(type) * thread_info
– length: size of the array– type: char, int, and float– thread_info: whether it’s copied to each thread
• Sort the arrays according to Size• Mark each array as shared until the size exceeds
the limit of shared memory
Sep 11, 2009
Other Optimizations
• Reduce memory allocation and copy overhead• Arrays shared by multiple iterations can be allocated
and copied only once
• User defined combination function• Possibly reduce the amount of data to combine
Sep 11, 2009
Experiments
• Applications: popular data mining applications• K-means clustering• EM clustering• PCA
• Compare performance of manual CUDA code and automatically generated code
• The effect of optimization, especially the use of shared memory
Sep 11, 2009
Experiment Results (K-means)
Results on GeForce 8800GTX Results on GeForce 9800GX2
0
10
20
30
40
50
60
64*1
256*
1
256*
2
256*
8
256*
32
256*
256
thread_number * bl ock_number
Spe
edup
ove
r C
PU
seq
uent
ial
vers
ion
manual-computingmanual-computing with copyautomatic-computingautomatic-computing with copy
05
10152025303540
64*1
256*
1
256*
2
256*
8
256*
32
256*
256
thread_number * bl ock_numberS
peed
up o
ver
CP
U s
eque
ntia
lve
rsio
n
manual-computingmanual-computing with copyautomatic-computingautomatic-computing with copy
Sep 11, 2009
Experiment Results (EM)
0
5
10
15
20
64*1
128*
1
256*
1
256*
2
256*
4
256*
8
256*
16
256*
32
256*
64
256*
256
t hread_number * bl ock_number
Spe
edup
ove
r C
PU
seq
uent
ial v
ersi
on
manual-computing
manual-computing with copy
automatic-computing
automatic-computing with copy
Sep 11, 2009
Experiment Results (PCA)
0
5
10
15
20
25
64*1
128*
1s
256*
1
128*
2
128*
4
128*
8
128*
16
128*
32
128*
64
thread_number * bl ock_number
Spe
edup
ove
r C
PU
sequ
entia
l ver
sion
computingcomputing with copyoptimized computingoptimized computing with copy
Sep 11, 2009
Outline of Contents• Introduction• Current Work
• Translation system for enabling data mining applications on GPUs
• Automatic translation of data mining applications from MATLAB to GPUs
• Automatic code generation for data mining on clusters with GPU support
• Future work• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework
Sep 11, 2009
GMAT-DM: Automatic Transformation from MATLAB for
GPUs
MATLAB code
GMAT-DM
OCTAVE parser
C code
CUDA code
GREENRIDE
Sep 11, 2009
Main Issues in GMAT-DM
• Matrix Manipulation• Inner representation as 1-d arrays• Type and shape information based on user input
• Basic problem: matrix operation translation• Matrix Multiplication Chain
• Function combination• Other optimizations
Sep 11, 2009
A * B * C A:6*100; B: 100*2; C: 2*1
1. (A*B)*CInstruction count: 1212Intermediate storage: 12
2.A*(B*C)Instruction count: 800Intermediate storage: 100
Use weighted metrics to determine the multiplication chain
Matrix Multiplication Chain
Sep 11, 2009
Function Combination
for (i = 0; i < maxRow; i++) for (j = 0; j < k; j++) { /* code to compute distances */ d[i][j] = sqrt(distances); }for(int i1 = 0; i1<maxRow; i1++){ z[i1] = 0; g[i1] = MAX_VALUE; for(int i2 = 0; i2 < k; i2++) {
if(z[i1] > d[i1][i2]){
z[i1] = d[i1][i2];g[i1] = i2;
} } }
float temp 0, temp z;int temp g;for (i = 0; i < maxRow; i++){ temp_z = MAX_VALUE; temp_g = 0; for (j = 0; j < k; j++) {
/* code to compute distances */temp_0 = sqrt(distances);if(temp_z > temp_0){
temp_z = temp_0;temp_g = j;
} } z[i] = temp_z; g[i] = temp_g;}
Sep 11, 2009
Other Optimizations
• Avoid memory allocation overhead – use index transformation• Row extraction• Matrix transposition
• Rename when a single element is accessed through many iterations
Sep 11, 2009
Experiment Results (K-means)
0
5
10
15
20
25
30
35
64*1
128*
1
256*
1
512*
1
256*
2
256*
4
256*
8
256*
16
256*
32
256*
64
256*
256
thread_number * block_number
Spee
dup
C unoptimized optimized
C with copy unoptimized with copy optimized with copy
Sep 11, 2009
Experiment Results (EM)
E phase M phase
0
2
4
6
810
12
14
16
18
20
64*1
128*
1
256*
1
256*
2
256*
4
256*
8
256*
16
256*
32
256*
64
256*
256
thread_number * block_number
spee
dup
C unoptimizedoptimized optimized-with-copy
02468
10
1214161820
64*1
128*
1
256*
1
256*
2
256*
4
256*
8
256*
16
256*
32
256*
64
256*
256
thread_number * block_numbersp
eedu
p
C unoptimizedoptimized-1 optimized-2optimized-2-with-copy
Sep 11, 2009
Experiment Results (PCA)
0
5
10
15
20
25
128*
1
128*
2
128*
4
128*
8
128*
16
128*
32
128*
64
thread_number * block_number
Sp
eed
up
MATLAB CMATLAB-with copy C-with copy
Sep 11, 2009
Outline of Contents• Introduction• Current Work
• Translation system for enabling data mining applications on GPUs
• Automatic translation of data mining applications from MATLAB to GPUs
• Automatic code generation for data mining on clusters with GPU support
• Future work• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework
Sep 11, 2009
AUTO-GC: Automatic Code Generation for FREERIDE with GPU
Support
Revisit the common structure of data mining applications (adapted from FREERIDE)
While () {{ * Reduction Loop * }Foreach (element e) {
(i,val) = process(e);Reduc(i) = Reduc(i) op val; GPU
}
Finalize(); }
Sep 11, 2009
Code Generation System
Sep 11, 2009
Experiment Results (K-means)
1.5GB data set 3GB data set
Sep 11, 2009
Experiment Results (PCA)
64M rows, 4 components 2M rows, 64 components
Sep 11, 2009
Outline of Contents• Introduction• Current Work
• Translation system for enabling data mining applications on GPUs
• Automatic translation of data mining applications from MATLAB to GPUs
• Automatic code generation for data mining on clusters with GPU support
• Future work• Optimize shared memory usage• Support more categories of MATLAB programs• Extend FREERIDE framework
Sep 11, 2009
Optimize Shared Memory Usage
• Use constant and texture memory• cached and read-only
• Integer programming model• Loop transformation
Sep 11, 2009
Integer Programming Model
Variables: x[i][j], 0 ≤ i ≤ m, 0 ≤ j ≤ n, whether variable i is kept in shared memory at loop jObjective function: maximize z = Σi {1..m}, j {1..n}∈ ∈ (Acc[i][j] − 1) x[i][j] Wp[i][j];∗ ∗Constraints:For i in 1..m: Σj 1..n∈ x[i][j] W[i][j] Sizep[i][j] ≤ S;∗ ∗For j in 1..n: For i in 1..nLive[j]:
x[live[j][i].start][j] = x[live[j][i].start + 1][j] = ... = x[live[j][i].end][j];
Sep 11, 2009
Loop Transformation
• Loop Fusion– Reduce intermediate memory– For loops with the same size, and variable[k] is
temporary storage
Loop 1 (size[1] = n): Active[1][k]=1 ......Loop 2 (size[2] = n): Active[2][k]=1 ......
Sep 11, 2009
• Loop Fission– Fit more data in shared memory– Should not violate data dependency
Loop Transformation
1. for (int r = 0; r < n; r++) //Loop 12. {3. · · ·4. for(int c = 0; c < m; c + +) //Loop 25. i[c]· · · ;6. for(int c = 0; c < m; c + +) //Loop 37. j[c]· · · ;8. · · ·9. }
Sep 11, 2009
• Loop Switching– Sort all the loops in decreasing order of
iteration numbers– Enable better usage of shared memory– Should not violate dependency
Loop Transformation
Sep 11, 2009
Support More Categories of MATLAB Programs
• Include library functions and determining thread grid configuration
• Deal with sparse matrix computation• Investigate basic functions in SPM library and
parallelization of ICA
Sep 11, 2009
Include Library Functions and Determining Thread Grid
Configuration
• 2-D blocking or 1-D blocking• Use library functions
Depending on the input data
Might need separating kernels to use different configurations
Need a model to evaluate the overhead of multiple kernels
Sep 11, 2009
Deal with Sparse Matrix Computation
• Represent sparse matrix in GPU• Coordinate representation • Diagonal representation (easy to be combined with
computing context)• Might use Sparse Accelerator (evaluate efficiency of
shared memory usage)
• Representation conversion• Better to be done on CPU
Sep 11, 2009
Investigate Basic Functions in SPM Library and Parallelization of ICA
• Enrich GMAT-DM• Provide GPU support for more functions and operations in MATLAB, especially sparse matrix computation
• Study possible optimization to communication and synchronization for SPM and ICA• For clusters or GPU, depending on my progress and time
Sep 11, 2009
Extend FREERIDE Framework
• Convert MATLAB programs into C++ and CUDA code in the FREERIDE framework – Based on GMAT-DM and AUTO-GC
• Cost model to determine device configuration
cost[i] = (mem[i] LATENCY + comp[i] + shared[i]) ∗ ∗Iter[i]+shared[i] LATENCY∗total_cost =∑m
i=1 cost[i]/number_iterationscost_combine =∑i output∈ size[i]*total_threadsIf total_cost + cost_combine > T Then use GPUElse use CPU
Sep 11, 2009
Conclusion
• Contributions– GREENRIDE: Translating Data Mining Applications onto
GPUs– GMAT-DM: Automatic Translation of Data Mining
Applications from MATLAB to GPUs– AUTO-GC: Automatic Code Generation for FREERIDE with
GPU Support• Proposed work
– Optimize shared memory usage– Support more categories of MATLAB programs– Extend FREERIDE framework
Sep 11, 2009
Thank you !
• Questions?