27
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Embed Size (px)

Citation preview

Page 1: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

A Translation System for Enabling Data Mining Applications on GPUs

Wenjing Ma Gagan Agrawal

The Ohio State University

ICS 2009

Page 2: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006

Motivation and Overview

• Two Popular Trends

– Data-intensive computing

– GPU programming

• Seems like a good match

• Can we ease use of GPGPUs ?

– Domain-specific Programming Tool

– Can exploit common programming structure

– Enable good speedups

ICS 2009

Page 3: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006

Context

• Many years of work on compiler and runtime support for data-intensive applications – Clusters, SMPs, Cluster of SMPs – FREERIDE and language front-ends

• Similar to map-reduce but … – Predates it and performs better !!

– Recent work on • (Cluster of) Multi-cores, Incorporate RSTM • GPUs – C and Matlab front-end

– Cluster of GPUs, Multi-core and GPUs

ICS 2009

Page 4: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Outline

• Background • GPU Computing • Parallel Data Mining

• Challenges of Data Mining on GPU• Architecture of the System

– Sequential code analysis– Generation of CUDA programs– Optimization Techniques

• Experimental Results– k-means, EM, PCA

• Related and future work

ICS 2009

Page 5: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Background - GPU Computing

• Many-core architectures/Accelerators are becoming more popular

• GPUs are inexpensive and fast• CUDA is a high-level language for GPU

programming

Page 6: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

CUDA Programming

• Significant improvement over use of Graphics Libraries

• But .. • Need detailed knowledge of the architecture of GPU and a

new language • Must specify the grid configuration• Deal with memory allocation and movement• Explicit management of memory hierarchy

Page 7: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Parallel Data mining

• Common structure of data mining applications (FREERIDE)

• /* outer sequential loop *//* outer sequential loop */

  while() {while() {

  /* Reduction loop *//* Reduction loop */

  Foreach (element e){Foreach (element e){

  (i, val) = process(e);(i, val) = process(e);

  Reduc(i) = Reduc(i) Reduc(i) = Reduc(i) opop val; val;

  }}

Page 8: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006

Porting on GPUs

• High-level Parallelization is straight-forward • Details of Data Movement • Impact of Thread Count on Reduction time• Use of shared memory

Page 9: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Architecture of the System

Variable information

Reduction functions

Optional functions Code

Analyzer( In LLVM)

Variable Analyzer

Code Generator

Variable Access

Pattern and Combination Operations

Host Program

Grid configuration

and kernel invocation

Kernel functions

Executable

User Input

Page 10: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006

User Input

A sequential reduction function

Optional functions (initialization function, combination function…)

Values of each variable or size of array

Variables to be used in the reduction function

Page 11: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Analysis of Sequential Code

• Get the information of access features of each variable

• Determine the data to be replicated• Get the operator for global combination• Variables for shared memory

Page 12: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006

Memory Allocation and Copy

Copy the updates back to host memory after the Copy the updates back to host memory after the kernel reduction function returnskernel reduction function returns

CC..

Need copy for each threadNeed copy for each thread

T0 T1 T2 T3 T4 T61 T62 T63 T0 T1

…… ……

T0 T1 T2 T3 T4 T61 T62 T63 T0 T1

……

AA..

BB..

Page 13: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Extract information of variable access

Variable analyzer

IR from LLVM

Extract variables to be written

Argument list

Extract read-only variables

User input

Extract temporary variables

Page 14: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Generating CUDA Code and C++/C code Invoking the Kernel Function

• Memory allocation and copy• Thread grid configuration (block number and

thread number)• Global function• Kernel reduction function• Global combination

Page 15: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Kernel Reduction Function

• Generated out of the original sequential code• Divide the main loop by block_number and

thread_number• Replace the access offsets with appropriate

indices

Page 16: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Optimizations

• Using shared memory• Providing user-specified initialization functions

and combination functions• Specifying variables that are allocated once

Page 17: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Dealing with Shared memory

• Size = length * sizeof(type) * thread_info– length: size of the array– type: char, int, and float– thread_info: whether it’s copied to each thread

• Mark each array as shared until the size exceeds the limit of shared memory

Page 18: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Shared memory layout Strategies

• No-sorting• Greedy sorting• Write-first sorting

Page 19: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

No sorting

Shared Memory

BA C D

Page 20: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Greedy sorting

Shared Memory

BA C D

B AC D

Page 21: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Other Optimizations

• Reducing Memory allocation and copy overhead – Arrays shared by multiple iterations can be allocated

and copied only once

• User defined combination function

Page 22: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Applications

• K-means clustering• EM clustering• PCA

Page 23: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Experiment Results

Speedup of k-means

Page 24: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Speedup of EM

Page 25: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006ICS 2009

Speedup of PCA

Page 26: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006

Related Work

• OpenMP to CUDA (Purdue) • Domain-specific operators to CUDA (NEC) • CUDA-lite etc. (Illinois) • Various application studies

Page 27: Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009

Euro-Par, 2006

Conclusions

• Automatic CUDA Code Generation and Optimization is feasible

• Restricting to domain / communication style helps • Interesting new compiler optimizations